Triton 程式設計入門：平行執行模型—

從串行的 CPU 程式設計過渡到 GPU 程式設計，需要一次範式轉變：從逐元素迭代轉為 基於區塊的執行。我們不再將資料視為一連串的標量，而是視為可排程以充分利用硬體頻寬的「區塊」集合。

一個核心的瓶頸取決於數學運算次數與記憶體存取次數的比率。向量加法通常屬於記憶體受限因為每進行三次記憶體操作（兩次載入、一次儲存）才執行一次加法。硬體花費在等待 DRAM 的時間，遠多於實際計算的時間。

BLOCK_SIZE 定義了平行運作的細緻程度。如果太小，我們無法充分運用 GPU 宽廣的執行通道。適當的大小能確保足夠的「飛行中工作」，以達到記憶體匯流排的飽和狀態。

佔用率 是 GPU 上活躍區塊的數量。雖然這不是最終目標，但它讓排程器能在某個區塊等待從顯存取得高延遲資料時，切換至另一個新區塊來執行運算。

為了最大化效能，我們必須讓自己的 BLOCK_SIZE 與 GPU 架構的記憶體合併規則保持一致，確保連續的線程能存取連續的記憶體位址。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For a kernel that adds two vectors ($out = x + y$), what is the most likely bottleneck on modern GPUs?

Arithmetic Throughput

Memory Bandwidth

Shared Memory Latency

QUESTION 2

What is the primary purpose of 'Occupancy' in the GPU execution model?

To ensure every thread runs as fast as possible.

To hide memory latency by keeping work in flight.

To increase the clock speed of the compute units.

To reduce the power consumption of the HBM.

QUESTION 3

Which of the following describes 'Memory-Bound' behavior?

The GPU is waiting for the memory bus to deliver data.

The GPU has exhausted its available VRAM.

The kernel is performing too many complex floating-point operations.

The CPU cannot launch kernels fast enough.

QUESTION 4

What happens if the BLOCK_SIZE is set too small?

The kernel will fail with a memory error.

The GPU fails to utilize its wide SIMD execution lanes.

The memory bandwidth increases significantly.

QUESTION 5

In the logistics warehouse analogy, what represents the 'Blocks'?

The individual items.

The workers.

The organized pallets.

The delivery trucks.